Multithreaded processor capable of implicit multithreaded execution of a single-thread program

ABSTRACT

A device is presented including a first processor and a second processor. A number of memory devices are connected to the first processor and the second processor. A register buffer is connected to the first processor and the second processor. A trace buffer is connected to the first processor and the second processor. A number of memory instruction buffers are connected to the first processor and the second processor. The first processor and the second processor perform single threaded applications using multithreading resources. A method is also presented where a first thread is executed from a first processor. The first thread is also executed from a second processor as directed by the first processor. The second processor executes instructions ahead of the first processor.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] This invention relates to multiprocessors, and more particularlyto a method and apparatus for multithreaded execution of single-threadprograms.

[0003] 2. Description of the Related Art

[0004] In many processing systems today, such as personal computers(PCs), single chip multiprocessors (CMP) play an important roll inexecuting multithreaded programs. The threads that these processors mayprocess and execute are independent of each other. For instance, threadsmay be derived from independent programs or from the same program. Somethreads are compiled creating threads that do not have dependenciesbetween themselves. In a multi-threading environment, however, somesingle-thread applications may be too difficult to convert explicitlyinto multiple threads. Also, running existing single-thread binaries onmulti-threading processor does not exploit the multi-threadingcapability of the chip.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The invention is illustrated by way of example and not by way oflimitation in the figures of the accompanying drawings in which likereferences indicate similar elements. It should be noted that referencesto “an” or “one” embodiment in this disclosure are not necessarily tothe same embodiment, and such references mean at least one.

[0006]FIG. 1 illustrates an embodiment of the invention.

[0007]FIG. 2 illustrates a commit processor of an embodiment of theinvention.

[0008]FIG. 3 illustrates a speculative processor of an embodiment of theinvention.

[0009]FIG. 4 illustrates a store-forwarding buffer of an embodiment ofthe invention.

[0010]FIG. 5 illustrates a load-ordering buffer of an embodiment of theinvention.

[0011]FIG. 6 illustrates an embodiment of the invention having a system.

[0012]FIG. 7 illustrates a block diagram of an embodiment of theinvention.

DETAILED DESCRIPTION OF THE INVENTION

[0013] The invention generally relates to an apparatus and method tomultithreaded execution of single-thread programs. Referring to thefigures, exemplary embodiments of the invention will now be described.The exemplary embodiments are provided to illustrate the invention andshould not be construed as limiting the scope of the invention.

[0014]FIG. 1 illustrates one embodiment of the invention comprisingmultiprocessor 100. In one embodiment of the invention, multiprocessor100 is a dual core single chip multiprocessor (CMP). Multiprocessor 100further comprises commit central processing unit (CPU) 110, speculativeCPU 120, register file buffer 130, trace buffer 140, load buffer 150(also known as load ordering buffer), store buffer 160 (also known asstore forwarding buffer), L1 cache 175, L2 cache 170, L0 instructioncache (I cache) 180, and L0 data cache (D cache) 190. In one embodimentof the invention L0 I cache 180 comprises two L0 I cache components. Oneof the L0 I cache 180 components is coupled to commit processor 110, andthe other L0 I cache 180 component is coupled to speculative processor120. In this embodiment of the invention, the two I cache componentsmaintain duplicate information. In one embodiment of the invention,fetch requests are issued to L1 cache 175 from either of the L0 I cache180 components. Lines fetched from L1 cache 175 are filled into L0 Icache 180 coupled to speculative processor 120 and commit processor 110.It should be noted that embodiments of the invention may contain anycombination of cache memory hierarchy without diverging from the scopeof the invention.

[0015] In one embodiment of the invention, L0 D cache 190 comprises twoL0 D cache components. One of the L0 D cache 190 components is coupledto commit processor 110, and the other L0 D cache 190 component iscoupled to speculative processor 120. In this embodiment of theinvention, the two L0 D cache components maintain duplicate information.In this embodiment of the invention, store instructions/commands(stores) associated with speculative processor 120 are not written intoL0 D cache 190. In this embodiment of the invention line read and writerequests are issued to L1 cache 175 from either L0 D cache component.Lines fetched from L1 cache 175 are filled into L0 D cache 190components coupled to commit processor 110 and speculative processor120. Stores issued from commit processor 110 are written into the L0 Dcache component coupled to speculative processor 120. By having exactcopies of data in each L0 D cache component, internal snooping is notnecessary.

[0016] In one embodiment of the invention, register file buffer 130comprises an integer register buffer and a predicate register filebuffer. In one embodiment of the invention the integer register filebuffer comprises a plurality of write ports, a plurality of checkpointsand at least one read port. The integer register file buffer is used tocommunicate register values from commit processor 110 to speculativeprocessor 120. In one embodiment of the invention, the integer registerfile buffer comprises eight (8) write ports, four (4) checkpoints, andone (1) read port to access any of the checkpointed contexts. In oneembodiment of the invention, the integer register file buffer has aneight (8) register wide array and sixteen (16) rows. In one embodimentof the invention, the predicate register file buffer comprises aplurality of write ports, a plurality of checkpoints and at least oneread port. The predicate register file buffer is used to communicateregister values from commit processor 110 to speculative processor120.and a second level register file coupled to speculative processor120. In one embodiment of the invention, the predicate register filebuffer comprises eight (8) write ports, four (4) checkpoints, and one(1) read port to access any of the checkpointed contexts. In oneembodiment of the invention, the predicate register file buffer has aneight (8) register wide array and eight (8) rows.

[0017]FIG. 2 illustrates commit CPU 110. In one embodiment of theinvention, commit CPU 110 comprises decoder 211, scoreboard 214,register file 212, and execution units 213. Likewise, FIG. 3 illustratesspeculative CPU 120. In one embodiment of the invention, speculative CPU120 comprises decoder 321, scoreboard 324, register file 322, andexecution units 323. L2 cache 170 and L1 cache 175 are shared by commitCPU 110 and speculative CPU 120. In one embodiment of the invention,multiprocessor 100 is capable of executing explicitly multithreadedprograms. In another embodiment, multiprocessor 100 is capable ofexecuting single-threaded applications while using a multi-threadenvironment without converting the single-threaded application to anexplicit multiple-thread.

[0018] In one embodiment of the invention, program execution begins as asingle thread on one of commit CPU 110 and speculative CPU 120. In oneembodiment of the invention, commit CPU 110 fetches, decodes, executesand updates register file 212, as well as issue loadinstructions/commands (loads) and stores to memory as instructed by theprogram. As the instructions are decoded, commit CPU 110 may directspeculative CPU 120 to start executing a speculative thread at someprogram counter value. This program counter value may be the address ofthe next instruction in memory, or it may be supplied as a hint by acompiler. For example, a fork at a next instruction address may be athread forked at a call instruction. Speculative CPU 120 continues itsthread execution until a program counter in commit CPU 110 reaches thesame point in the program execution for which the speculative threadprogram counter points. Therefore, commit CPU 110 fetches, issues andcommits every instruction in the program, even when an instructionbelongs to a speculative thread.

[0019] In one embodiment of the invention, the dual executionarchitecture of multiprocessor 100 has a benefit wherein speculative CPU120, executing farther in the program, provides highly efficientprefetch of instructions and data. Also, speculative CPU 120 determinesthe direction of many branches before the control flow of commit CPU 110reaches these branches. In one embodiment of the invention, commit CPU110 receives information on control flow direction from speculative CPU120, and therefore, commit CPU 110 can avoid branch prediction for manybranches and the associated misprediction penalty. In one embodiment ofthe invention, dependent and adjacent instructions executed correctly bythe speculative thread can have the results concurrently committed inone commit cycle by commit CPU 110, saving time normally required toserially execute and propagate results between dependent instructions.

[0020] In one embodiment of the invention, input register values to thespeculative thread are communicated through register buffer 130. Allvalues written into register file 212, of commit CPU 110, are alsowritten into register file buffer 130. In one embodiment of theinvention when the speculative thread is spawned, a snapshot of registerfile 212 is available in register file buffer 130, located betweencommit CPU 110 and speculative CPU 120. Initially, when a speculativethread is started, none of speculative CPU 120's registers have theinput value stored in them. Input registers that are needed may be readon demand from register file buffer 130. In one embodiment of theinvention, scoreboard 324 in speculative CPU 120's decode stage is usedto keep track of which registers are loaded from register file buffer130, or written by the speculative thread. Those registers are valid inregister file 322. All other registers are read on demand from registerfile buffer 130.

[0021] In one embodiment of the invention, input memory values to thespeculative thread are read from the coherent cache hierarchy, allowingthe speculative thread to access memory modified by the commit thread.In one embodiment of the invention, a cache coherency scheme is usedwhere d-cache 190 is a write through cache, and L2 cache 170 is a writeback cache using a MESI (M: modified; E: exclusive; S: shared; I:invalid) cache coherency protocol. One should note, however, that othercache coherency protocols may also be used in other embodiments of theinvention.

[0022] Depending on the data flow in a particular program, commit CPU110 may produce some register or memory input values after these inputsare read by the speculative thread. In one embodiment of the invention,to relax the limitations imposed by register and memory data flow, valueprediction is used to provide initial input values to the speculativethread. In one embodiment of the invention, a simple value predictionmethod is used having passive prediction. In this embodiment, it isassumed that register and memory input values have already been producedby commit CPU 110 at the time the speculative thread is spawned.

[0023] In one embodiment of the invention, speculative results arewritten into register file 322 of CPU 120 as well as trace buffer 140.In one embodiment of the invention, trace buffer 140 is a circularbuffer implemented as an array with head and tail pointers. In oneembodiment of the invention, the head and tail pointers have awrap-around bit. In one embodiment of the invention, trace buffer 140has an array with one read port and one write port. In this embodimentof the invention, each entry has enough bytes to store the results of anumber of instructions at least equal in number to the issue width ofcommit CPU 110. In this embodiment of the invention, each entry has abit per instruction, with a second write port used to mark mispredictedloads.

[0024] In one embodiment of the invention, trace buffer 140 has onehundred-and-twenty-eight (128) entries that can each store results forsix (6) instructions. In one embodiment of the invention, trace buffer140 has four (4) partitions to support four (4) threads. In oneembodiment of the invention, trace buffer 140 accommodates sixteen (16)bytes for storing two outputs per instruction, four (4) bytes to storerenamed registers, and one (1) bit to mark if an instruction is amispredicted load. In one embodiment of the invention, the mispredictedload bit can be set by six (6) write ports from load buffer 150. In oneembodiment of the invention, when a thread partition is full,speculative execution is continued to prefetch into LO I cache 180 andL0 D cache 190, but results are not written into the trace buffer.

[0025] In one embodiment of the invention commit CPU 110 has scoreboard214 that comprises one bit per register. In this embodiment of theinvention, any modification of a register by commit CPU 110 between thefork point and the join point of a speculative thread causes theregister scoreboard bit to be set. As commit CPU 110 retires thespeculative thread results, it continuously keeps track in scoreboard214 of all registers that are mispredicted. In this embodiment of theinvention, instructions whose source register scoreboard bits are clearare safely committed into register file 212. Such instructions, even ifdependent, do not have to be executed. There are some exceptions,however, such as loads and stores. Load and store exceptions have to beissued to memory execution units 213 to service cache misses and tocheck for memory ordering violations. Results of branch execution arealso sent from speculative CPU 120 to commit CPU 110. Branch predictionin commit CPU 110 can be bypassed for some or all of the branchesexecuted by speculative CPU 120.

[0026] In one embodiment of the invention loads and stores associatedwith commit processor 110 snoop load buffer 150. In one embodiment ofthe invention, when an instruction is replayed or if an instruction is amispredicted load, the instructions associated destination register bitis set in scoreboard 214. When the instruction is clean, its destinationregister bit is cleared in scoreboard 214. Note that an instruction isclean when its sources are clean. Scoreboard 214 is cleared when allspeculative thread instructions are committed.

[0027] In one embodiment of the invention, speculative CPU 120 does notissue store instructions to memory. In this embodiment of the invention,store instructions are posted in store buffer 160 and load instructionsare posted in load buffer 150. In one embodiment of the invention, storebuffer 160 is a fully associative store forwarding buffer. FIG. 4illustrates the structure of store buffer 160 in one embodiment of theinvention. In store buffer 160 (illustrated in FIG. 4) each entry 410comprises tag portion 420, valid portion 430, data portion 440, storeidentification (ID) 450 and thread ID portion 460. In one embodiment ofthe invention data portion 440 accommodates eight (8) bytes of data. Inone embodiment of the invention valid portion 430 accommodates eight (8)bits. Store ID 450 is a unique store instruction ID of the last storeinstruction to write into an entry 410. In one embodiment of theinvention, speculative loads access store buffer 160 concurrently withL0 D cache 190 access. If the load hits a store instruction in storebuffer 160, L0 D cache 190 is bypassed and a load is read from storebuffer 160. In this case, store ID 450 is also read out with the data.

[0028] In one embodiment of the invention, load data can be obtained byspeculative processor 120 from either store buffer 160 or L0 D cache 190associated with speculative processor 120. In one embodiment of theinvention, loads are posted into load buffer 150. In this embodiment ofthe invention, when a load is posted, a mispredicted load bit is set intrace buffer 140 in case of load buffer 150 overflow.

[0029] In one embodiment of the invention store buffer 160 has onehundred-and-twenty-eight (128) entries, where the entries are four (4)way set associative. In one embodiment of the invention, store buffer160 has two (2) store and two (2) load ports. In one embodiment of theinvention store buffer 160 allows a partial tag match using virtualaddresses for forwarding, and a full physical tag match to validateforwarding store ID's. In one embodiment of the invention store buffer160 stores data written in data portion 440 starting from the first byteto avoid alignment delay. In one embodiment of the invention storebuffer 160 has a replacement policy that replaces the oldest store upona store miss, otherwise it replaces a hit entry. In one embodiment ofthe invention thread ID 460 is an index to a partition in trace buffer140, and has a wrap around bit. In one embodiment of the invention, aglobal reset of thread entries is performed by using a thread ID contentaddressable memory (CAM) port (not shown).

[0030] In one embodiment of the invention, speculative loads are postedin load buffer 150. In one embodiment of the invention, load buffer 150is a set associate load buffer coupled to commit CPU 110. FIG. 5illustrates the structure of load buffer 150. In load buffer 150(illustrated in FIG. 5) each entry 510 comprises a tag portion 520, anentry valid bit portion 530, load ID 540, and load thread ID 550. In oneembodiment of the invention, tag portion 520 comprises a partial addresstag. In another embodiment, each entry 510 additionally has a storethread ID, a store ID, and a store valid bit (not shown). The Store IDis the ID of the forwarding store instruction if the load instructionhas hit the store buffer 160.

[0031] In one embodiment of the invention the store ID and/or load ID550 is an index into an entry in trace buffer 140, which is unique perinstruction. In one embodiment of the invention the store valid bit isset to zero (“0”) if a load hits store buffer 160. In this embodiment ofthe invention, the store valid bit is set to one (“1”) if the loadmissed store buffer 160. In one embodiment of the invention, a replayedstore that has a matching store ID clears (sets to “0”) the store validbit and sets the mispredicted bit in the load entry in trace buffer 140.In one embodiment of the invention, a later store in the program thatmatches tag portion 520 clears (sets to “0”) the store valid bit andsets the mispredicted bit in the load entry in trace buffer 140. In oneembodiment of the invention, a clean (not replayed) store that matchesthe store ID sets the store valid bit to “1” (one). In one embodiment ofthe invention, upon a clean (not replayed) load not matching any tag520, or a load matching tag 520 with the store valid bit clear (set to“0”), the pipeline is flushed, the mispredicted bit in the load entry intrace buffer 140 is set to one (“1”), and the load instruction isrestarted. In one embodiment of the invention, when a load entry isretired, entry valid bit portion 530 is cleared.

[0032] In one embodiment of the invention, load buffer 150 hassixty-four (64) entries that are four (4) way set associative. In oneembodiment of the invention, load buffer 150 has a policy that replacesan oldest load. In one embodiment of the invention a global reset ofthread entries is performed by using a thread ID CAM port (not shown).

[0033] In one embodiment of the invention, commit CPU 110 issues allloads and stores to memory execution units 213 (address generation unit,load buffer, data cache), including loads that were correctly executedby speculative processor 120. Valid load data with potentially dependentinstructions could be committed, even when a load instruction issued bycommit processor 110 misses L0 D cache 190. In one embodiment of theinvention, a load miss request is sent to L2 cache 170 to fill the line,but the return data is prevented from writing to register file 212. Inone embodiment of the invention, every load instruction accesses loadbuffer 150. A load miss of load buffer 150 causes a pipeline flush and arestart of the load instruction and all instructions that follow it.

[0034] In one embodiment of the invention, stores also access loadbuffer 150. In one embodiment of the invention, when an address matchingstore that also matches store ID 540, validity bit 530 is set in anentry 510. In this embodiment of the invention, a later store that hitsan entry 510 invalidates the entry 510. In this embodiment of theinvention when a store invalidates an entry 510, a load ID 550 is usedto index trace buffer 140 to set the miss predicted load bit. In thisembodiment of the invention when a load is fetched and the mispredictedload bit in trace buffer 140 is found to be set, a register bit is setin scoreboard 214. This register scoreboard bit may also be called theload destination scoreboard bit. In this embodiment of the invention,this optimization reduces the number of flushes that occur as the resultof load misses in load buffer 150. One should note that commit CPU 110concurrently reads trace buffer 140 and LO I cache 180. In thisembodiment of the invention, this concurrent read of trace buffer 140and L0 I cache 180 enables setting a scoreboard register bit inscoreboard 214 for a mispredicted load instruction in time withouthaving to stall the execution pipeline.

[0035] In one embodiment of the invention “replay mode” execution startsat the first instruction of a speculative thread. When a partition intrace buffer 140 is becoming empty, replay mode as well as speculativethread execution are terminated. In one embodiment of the invention,instruction issue and register rename stages are modified as follows: noregister renaming since trace buffer 140 supplies names; allinstructions up to the next replayed instruction, including dependentinstructions are issued; clean (not replayed) instructions are issued asno-operation (NOPs) instructions; all loads and stores are issued tomemory, and clean instruction results are committed from trace buffer140 to register file 130.

[0036]FIG. 6 illustrates system having an embodiment of the invention.System 600 comprises multiprocessor 100 (see FIG. 1), main memory 610,north bridge 620, hublink 630, and south bridge 640. Typically, thechief responsibility of north bridge 620 is the multiprocessorinterface. In addition, north bridge 620 may also have controllers foran accelerated graphics port (AGP), memory 610, and hub link 630, amongothers. South bridge 640 is typically responsible for a hard drivecontroller, a universal serial bus (USB) host controller, aninput/output (I/O) controller, and any integrated sound devices, amongstothers. In one embodiment of the invention, multiprocessor 100 containsembodiments of the invention described above.

[0037]FIG. 7 illustrates a process for an embodiment of the invention.Process 700 begins with block 170 which, starts the execution of aprogram thread by a first processor, such as commit processor 110. Block720 performs fetching of commands by the first processor. Block 730performs decoding of commands by the first processor. Block 740instructs a second processor, such as speculative processor 120, tobegin program execution of the same thread as the first processor, butat a location further in the program stream. Block 750 begins executionof the program thread by the second processor. On block 751 the secondprocessor fetches commands. In block 752, the second processor performsdecoding.

[0038] In block 753, the second processor updates a register file. Inblock 754, the second processor transmits control flow information tothe first processor. In block 760, the first processor updates aregister file. Block 770 determines whether the first processor hasreached the same point of execution as the second processor. If block770 determines that the first processor has not yet reached the samepoint in the program, process 700 continues with block 780 to continueexecution. If block 770 determines that the first processor has reachedthe same point in the execution as the second processor, block 790determines if the program is complete. If block 790 determines that theprogram is complete, process 700 stops, otherwise, process 700 continuesat A.

[0039] With the use of embodiments of the invention discussed above,performance can be increased when executing single-threaded applicationsas a result of the speculative long-range multithreaded pre-fetch andpre-execution. The embodiments of the invention can be implemented within-order and out-of-order multithreaded processors.

[0040] The above embodiments can also be stored on a device ormachine-readable medium and be read by a machine to performinstructions. The machine-readable medium includes any mechanism thatprovides (i.e., stores and/or transmits) information in a form readableby a machine (e.g., a computer). For example, a machine-readable mediumincludes read only memory (ROM); random access memory (RAM); magneticdisk storage media; optical storage media; flash memory devices;electrical, optical, acoustical or other form of propagated signals(e.g., carrier waves, infrared signals, digital signals, etc.). Thedevice or machine-readable medium may include a solid state memorydevice and/or a rotating magnetic or optical disk. The device ormachine-readable medium may be distributed when partitions ofinstructions have been separated into different machines, such as acrossan interconnection of computers.

[0041] While certain exemplary embodiments have been described and shownin the accompanying drawings, it is to be understood that suchembodiments are merely illustrative of and not restrictive on the broadinvention, and that this invention not be limited to the specificconstructions and arrangements shown and described, since various othermodifications may occur to those ordinarily skilled in the art.

What is claimed is:
 1. A apparatus comprising: a first processor and asecond processor; a plurality of memory devices coupled to the firstprocessor and the second processor; a register buffer coupled to thefirst processor and the second processor; a trace buffer coupled to thefirst processor and the second processor; and a plurality of memoryinstruction buffers coupled to the first processor and the secondprocessor; wherein the first processor and the second processor performsingle threaded applications using multithreading resources.
 2. Theapparatus of claim 1, wherein the memory devices comprise of a pluralityof cache devices.
 3. The apparatus of claim 1, wherein the firstprocessor is coupled to at least one of a plurality of zero level (L0)data cache devices and at least one of a plurality of L0 instructioncache devices, and the second processor is coupled to at least one ofthe plurality of L0 data cache devices and at least one of the pluralityof L0 instruction cache devices.
 4. The apparatus of claim 3, whereineach of the plurality of L0 data cache devices having exact copies ofdata cache instructions, and each of the plurality of L0 instructioncache devices having exact copies of instruction cache instructions. 5.The apparatus of claim 1, wherein the plurality of memory instructionbuffers includes at least one store forwarding buffer and at least oneload-ordering buffer.
 6. The apparatus of claim 5, the at least onestore forwarding buffer comprising a structure having a plurality ofentries, each of the plurality of entries having a tag portion, avalidity portion, a data portion, a store instruction identification(ID) portion, and a thread ID portion.
 7. The apparatus of claim 6, theat least one load ordering buffer comprising a structure having aplurality of entries, each of the plurality of entries having a tagportion, an entry validity portion, a load identification (ID) portion,and a load thread ID portion.
 8. The apparatus of claim 7, each of theplurality of entries further having a store thread ID portion, a storeinstruction ID portion, and a store instruction validity portion.
 9. Theapparatus of claim 1, the trace buffer is a circular buffer having anarray with head and tail pointers, the head and tail pointers having awrap-around bit.
 10. The apparatus of claim 1, the register buffercomprising an integer register buffer and a predicate register buffer.11. A method comprising: executing a plurality of instructions in afirst thread by a first processor; and executing the plurality ofinstructions in the first thread by a second processor as directed bythe first processor, the second processor executing the plurality ofinstructions ahead of the first processor.
 12. The method of claim 11,further including: transmitting control flow information from the secondprocessor to the first processor, the first processor avoiding branchprediction by receiving the control flow information; and transmittingresults from the second processor to the first processor, the firstprocessor avoiding executing a portion of instructions by committing theresults of the portion of instructions into a register file from a tracebuffer.
 13. The method of claim 12, further including: duplicatingmemory information in separate memory devices for independent access bythe first processor and the second processor.
 14. The method of claim12, further including: clearing a store validity bit and setting amispredicted bit in a load entry in the trace buffer if a replayed storeinstruction has a matching store identification (ID) portion.
 15. Themethod of claim 12, further including: setting a store validity bit if astore instruction that is not replayed matches a store identification(ID) portion.
 16. The method of claim 12, further including: flushing apipeline, setting a mispredicted bit in a load entry in the trace bufferand restarting a load instruction if one of the load is not replayed anddoes not match a tag portion in a load buffer, and the load instructionmatches the tag portion in the load buffer while a store valid bit isnot set.
 17. The method of claim 12, further including: executing areplay mode at a first instruction of a speculative thread; terminatingthe replay mode and the execution of the speculative thread if apartition in the trace buffer is approaching an empty state.
 18. Themethod of claim 12, further including: supplying names from the tracebuffer to preclude register renaming; issuing all instructions up to anext replayed instruction including dependent instructions; issuinginstructions that are not replayed as no-operation (NOPs) instructions;issuing all load instructions and store instructions to memory;committing non-replayed instructions from the trace buffer to theregister file.
 19. The method of claim 12, further including: clearing avalid bit in an entry in a load buffer if the load entry is retired. 20.An apparatus comprising a machine-readable medium containinginstructions which, when executed by a machine, cause the machine toperform operations comprising: executing a first thread from a firstprocessor; and executing the first thread from a second processor asdirected by the first processor, the second processor executinginstructions ahead of the first processor.
 21. The apparatus of claim20, further containing instructions which, when executed by a machine,cause the machine to perform operations including: transmitting controlflow information from the second processor to the first processor, thefirst processor avoiding branch prediction by receiving the control flowinformation.
 22. The apparatus of claim 21, further containinginstructions which, when executed by a machine, cause the machine toperform operations including: duplicating memory information in separatememory devices for independent access by the first processor and thesecond processor.
 23. The apparatus of claim 21, further containinginstructions which, when executed by a machine, cause the machine toperform operations including: clearing a store validity bit and settinga mispredicted bit in a load entry in the trace buffer if a replayedstore instruction has a matching store identification (ID) portion. 24.The apparatus of claim 21, further containing instructions which, whenexecuted by a machine, cause the machine to perform operationsincluding: setting a store validity bit if a store instruction that isnot replayed matches a store identification (ID) portion.
 25. Theapparatus of claim 21, further containing instructions which, whenexecuted by a machine, cause the machine to perform operationsincluding: flushing a pipeline, setting a mispredicted bit in a loadentry in the trace buffer and restarting a load instruction if one ofthe load is not replayed and does not match a tag portion in a loadbuffer, and the load instruction matches the tag portion in the loadbuffer while a store valid bit is not set.
 26. The apparatus of claim21, further containing instructions which, when executed by a machine,cause the machine to perform operations including: executing a replaymode at a first instruction of a speculative thread; terminating thereplay mode and the execution of the speculative thread if a partitionin the trace buffer is approaching an empty state.
 27. The apparatus ofclaim 21, further containing instructions which, when executed by amachine, cause the machine to perform operations including: supplyingnames from the trace buffer to preclude register renaming; issuing allinstructions up to a next replayed instruction including dependentinstructions; issuing instructions that are not replayed as no-operation(NOPs) instructions; issuing all load instructions and storeinstructions to memory; committing non-replayed instructions from thetrace buffer to the register file.
 28. The apparatus of claim 21,further containing instructions which, when executed by a machine, causethe machine to perform operations including: clearing a valid bit in anentry in a load buffer if the load entry is retired.
 29. A systemcomprising: a first processor; a second processor; a bus coupled to thefirst processor and the second processor; a main memory coupled to thebus; a plurality of local memory devices coupled to the first processorand the second processor; a register buffer coupled to the firstprocessor and the second processor; a trace buffer coupled to the firstprocessor and the second processor; and a plurality of memoryinstruction buffers coupled to the first processor and the secondprocessor, wherein the first processor and the second processor performsingle threaded applications using multithreading resources.
 30. Thesystem of claim 29, the local memory devices comprise a plurality ofcache devices.
 31. The system of claim 30, the first processor iscoupled to at least one of a plurality of zero level (L0) data cachedevices and at least one of a plurality of L0 instruction cache devices,and the second processor is coupled to at least one of the plurality ofL0 data cache devices and at least one of the plurality of L0instruction cache devices.
 32. The system of claim 31, wherein each ofthe plurality of L0 data cache devices having exact copies of data cacheinstructions, and each of the plurality of L0 instruction cache deviceshaving exact copies of instruction cache instructions.
 33. The system ofclaim 31, the first processor and the second processor each sharing afirst level (L1) cache device and a second level (L2) cache device. 34.The system of claim 29, wherein the plurality of memory instructionbuffers includes at least one store forwarding buffer and at least oneload ordering buffer.
 35. The system of claim 34, the at least one storeforwarding buffer including a structure having a plurality of entries,each of the plurality of entries having a tag portion, a validityportion, a data portion, a store instruction identification (ID)portion, and a thread ID portion.